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Lexical ambiguities naturally arise in languages. We present Lamb, a lexical analyzer that pro- 
duces a lexical analysis graph describing all the possible sequences of tokens that can be found 
within the input string. Parsers can process such lexical analysis graphs and discard any sequence 
of tokens that does not produce a valid syntactic sentence, therefore performing, together with 
Lamb, a context-sensitive lexical analysis in lexically-ambiguous language specifications. 
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I. INTRODUCTION 

A lexical analyzer, also called lexer or scanner, is a piece 
of software that processes an input string conforming to 
a language specification and produces a sequence of the 
tokens or terminal symbols found in it. The obtained 
sequence of tokens is then usually fed to a parser or syn- 
tactic analyzer as the next step of a data translation, 
compilation or interpretation procedure. 

Sometimes, lexical ambiguities may show up in a lan- 
guage specification. Lexical ambiguities occur when an 
input string simultaneously corresponds to several token 
sequences |9|. 

The traditional way of choosing a sequence amongst 
potential alternatives [6| involves assigning an unique pri- 
ority to each token. This causes that, when the regular 
expressions associated to two different tokens match the 
same fragment of the input string, only the one with the 
greater priority will be considered. 

However, the language developer may want similar 
substrings to be recognized as different sequences of to- 
kens depending on their context. This cannot be achieved 
with the unique priority approximation. 

Statistical lexical analyzers also exist Q. Although 
statistical models may perform well in context-sensitive 
scenarios, they require intensive training and, as token 
types are actually guessed, they do not formally guaran- 
tee that the obtained token sequence will be what the 
developer intended. 

When it comes to programming languages, data spec- 
ification languages, or limited natural languages scenar- 
ios, the syntactic rules are clear as to what should be 
accepted. The usage of statistical models introduces an 
unpredictable possibility of error during token recogni- 
tion that would render scanning and parsing theoretically 
and pragmatically unfeasible. 

Our proposal. Lamb (standing for Lexical AMBiguity), 
performs a lexical analysis that efficiently captures all the 
possible sequences of tokens and generates a lexical anal- 
ysis graph that describes them all. A subsequent parsing 
process discards any sequence of tokens that does not 
provide a valid syntactic sentence conforming to the syn- 
tactic rule set of the language specification. This solves 
the lexical ambiguity problem with formal correctness. 



Therefore, Lamb allows language developers to spec- 
ify more complex languages than traditional techniques. 
Token priorities are still supported but their usage is op- 
tional. Several tokens may be set to share the same pri- 
ority if the developer wants ambiguities involving them 
to be considered. 

As research in lexical analyzers sets the basis for the 
application of parsers, it inherits their application fields: 
the compilation or interpretation of source code written 
in programming languages ^Ij , the interpretation and in- 
tegration of data in data mining applications 4j, and 
natural language processing 



II. BACKGROUND 

The IEEE POSIX P1003.2 standard describes the re- 
quirements of the lex and yacc tools ^Q], which are a 
traditional lexical analyzer generator and a traditional 
syntactic analyzer generator, respectively. Implementa- 
tions of these tools are typically used in conjunction: 

• Lex generates a lexer that takes as input a set of to- 
ken types, associated regular expressions [l2|, and 
the string to be scanned; and produces the sequence 
of tokens found in the string. 

• Yacc generates a parser that takes as input the se- 
quence of tokens and a syntactic rule set; and pro- 
duces a parse tree. 

Regarding ambiguities, lex enforces the assignment of 
unique priorities to each token. Indeed, tokens are tried 
and matched in the very same order they have been spec- 
ified. 

The order of efficiency of a Zex-generated lexical ana- 
lyzer is 0{n), being n the input string length. 

The example lex specification in Figure [T] shows an ex- 
ample of implicitly reserved words, as the words "true" , 
"false" , "if" , or "while" will not be considered identifiers, 
because they will match BOOLEAN, IF, or WHILE to- 
kens before reaching the regular expression for IDEN- 
TIFIER. Therefore, it is not possible for lex to consider 
these words as identifiers in some contexts, even if syntac- 
tic rules make clear whether occurrences of these words 
should be considered as identifiers or not. 
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Statistical models as Hidden Markov Models 0, 0, [13 ; 
Hierarchical Hidden Markov Models |3|]j or Maximum 
Entropy Markov Models [1] consider the existence of im- 
plicit relationships between words, symbols, or characters 
that are close together in strings. 

These models need intensive corpus-based training and 
they produce results with associated implicit probabili- 
ties. 

It should be noted that, even though they can per- 
form well in natural language processing, their training 
requirement is impractical for programming or data rep- 
resentation languages, especially when the syntactic rules 
provide all the needed context information to unequivo- 
cally identify tokens. Furthermore, the results are prone 
to interpretation errors that would render the analysis 
unusable. 

The semi-syntactic lexical analyzer proposed in 
brings some of the context information found in the syn- 
tactic rule set to the deterministic finite automaton that 
perform the lexical analysis. Although this technique 
considers context information found in syntactic rules, it 
is not able to capture syntactic ambiguities for its further 
consideration, since the minimal automaton needed to do 
this is a non-deterministic finite automaton, which would 
have increased complexity of the algorithm. Therefore, if 
the lexical ambiguities may cause syntactic ambiguities 
or, in other words, there are several syntactic interpre- 
tations of the input string due to lexical ambiguities, a 
Shyu-like lexer would be unable to find them. 



III. LAMB 

In contrast to the aforementioned techniques, Lamb is 
able to recognize and capture lexical ambiguities. 

Our proposed algorithm takes as input the string to 
be scanned and a list of tokens associated to their corre- 
sponding regular expressions. It produces, as output, a 
lexical analysis graph, in which each token is connected to 
its following and preceding tokens in the input sequence. 

Our algorithm consists of two steps: the scanning step, 
which recognizes all the possible tokens in the input 
string; and the graph generation step, which computes 
the sets of preceding and following tokens for each token 
and builds the resulting lexical analysis graph. 



A. The Scanning Step 

The pseudocode for the scanning step is shown in Figure 

m 

Our algorithm receives an input string called input and 
a list of matchers called matcherlist. Each matcher con- 
sists of a regular expression and its corresponding match 
method, a priority value, and a next value. 

The match method tries to perform a match given the 
input string and a starting position in it. 

The priority value specifies the matcher priority. The 



if return(IF) ; 

while return(WHILE) ; 

true I false return(BQOLEAN) ; 
[_a-zA-Z] + return (IDENTIFIER) ; 

Figure 1 Example lex specification with implicitly reserved 
words ("true", "false", "if", and "while"). 



value is reserved for ignored patterns, which are pat- 
terns that represent text that does not correspond to to- 
kens. Then, priority values for relevant token start at 
1, being the lower the value, the higher the priority. If 
two tokens share the same priority value, the lexer will 
capture both of them if they overlap due to lexical am- 
biguities. If two tokens have distinct priority values and 
they start at the same position in the input string, only 
the greater priority token will be considered. 

The next value specifies the position before the next 
string position a matcher will be tried at. It defaults to 
-1, so every matcher will be tried at the position. 

The prio variable represents the last priority that has 
been matched in the current input position. Its value is 
-1 if no match has been made, if an ignored element 
match has been made, and a higher value if any token of 



for i in .. input . length()-l : 
prio = -1 

for each matcher m in matcherlist: 
if (prio >= m.prio I I prio == -1) && 
(prio != && next[j] < i) : 
match = m.match(input , i) 
if match != null: 

priority = matcher .priority 
if m. isignore==f alse : 
t = new token ( 
id = id, 

type = matcher .type, 
text = match, 
start = i, 

end = i+match.length()-l 

) 

tokenlist . add(t) 
id++ 

min = i+match. length -1 
for each matcher n in matcherlist: 
if n.next <= min && n.next >= i: 

min = n.next 
if n.next > m.next: 

n.next = i+match. length()-l 
if i >= min: 

min = i 
m.next = min 

for each matcher n in matcherlist: 
if n.prio > m.prio: 
n.next = min 

Figure 2 Pseudocode of the scanning step in our lexical anal- 
ysis algorithm. 
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for i in tokenlist . size () -1 . . : 
t = tokenlist [i] 

for j in i+1 .. tokenlist . size -1 : 
tc = tokenlist [j] 
if (tc . start > t . end && 

(tc .prevstart==tc . start II 
(tc .prevstart<tc . start && 
tc .prevstart<t . end) ) ) : 
t . addf ollowing(tc) 
tc . addpreceding(t) 
tc.prevstart = min(t. start, 

tc .prevstart) 

Figure 3 Pseudocode of the graph generation step in our lex- 
ical analysis algorithm. 



that specific priority lias been identified. 

The min variable is computed in order to determine the 
next position the current matcher will be tried at, and 
its value is the minimum of either the ending position of 
the found token or the ending position of any tokens that 
end before it. 

This algorithm step has a theoretical order of efficiency 
of 0{n^ ■ I), being n the input string length and I the 
number of matchers in the lexer. 



B. The Graph Generation Step 

The algorithm pictured in Figure [3] goes through the 
identified token list in reverse order and efficiently com- 
putes the sets of preceding and following tokens for every 
token. 

The sets of preceding and following tokens of the token 
X are defined in Equation[Tl being a, 6, c tokens and Xstart 
and Xend the starting and ending positions of the token 
X in the input string. 

h eFOLLOWING{a),a e PRECEDING{h) iif 

O-end < b start & $C, Cgtart > O^endiC-end < b start 

The prevstart variable in the pseudocode avoids the 
need of iterating through the token list to find out if there 
is any token between two particular tokens, because it 
represent the starting position values of preceding tokens, 
given a certain token. 

After the following and preceding sets have been com- 
puted for every token, any token whose preceding set is 
empty is added to the start token set of the lexical anal- 
ysis graph. 

The graph generation has a theoretical order of effi- 
ciency of O(i^), being t the number of tokens found. As 
t < n ■ I, the theoretical order of efficiency of this step is 
0(n2.;2). 

Both scanning and graph generation steps together 
have an order of efficiency of 0{n^ ■ P). 



do: 

flag = false 

for each rule r in rules : 

for each token t in tokenlist: 
matches = r.match(t) 
if matches . size != 0: 

for each match m in matches : 
if ! tokenlist . contains (m) : 
tokenlist . add(m) 
if m is start symbol 

start . add(m) 
flag = true 
while flag = true 

Figure 4 Pseudocode of the proof of concept parser support- 
ing ambiguities. 



IV. COMPARISON 

In order to perform a formal comparison of traditional 
techniques and Lamb, we have implemented a simple 
(and inefficient) proof of concept parser that supports 
ambiguities and allows a lexical analysis guided by a syn- 
tactic rule set. This parser returns as many parse trees 
as they can be obtained by applying a set of syntactic 
rules to a lexical analysis graph. 

Its pseudocode is shown in Figure 21 It iteratively tries 
to match every rule starting from every existing token 
and following any possible token path, and it adds the 
newly found tokens to the list until no new tokens have 
been found in an iteration. 

Given a language specification that describes the to- 
kens listed in Figure[Tni the input string "&5.2& /25.20/" 
can correspond to the four different lexical analysis alter- 
natives enumerated in Figure [TTl depending on whether 
the sequences of digits separated by points are considered 
real numbers or integer numbers separated by points. 

The syntactic rules shown in Figure [T^ illustrate a sce- 
nario of lexical ambiguity sensitivity. Depending on the 
surrounding tokens, which may be either Ampersand to- 
kens or Slash tokens, the sequences of digits separated by 
points should be considered either Real tokens or Integer 
Point Integer token sequences. The expected results of 
analyzing the input string "&5.2& /25.20/" is shown in 
Figure [51 

In order to resolve the ambiguities when using a lex- 
alike lexer, the developer can assign the Integer token 
a greater priority than the Real token. In that case, 
the only valid interpretation would be the one shown in 
Figure ini The developer can also assign the Real token 
a greater priority than the Integer token. In that case, 
the only valid interpretation would be the one shown in 
Figure [71 Therefore, /ea;-alike lexers cannot produce the 
token sequence that is needed to parse strings that belong 
to our language with lexical ambiguities. 

Nonetheless, as Lamb is able to capture all the possible 
token sequences in the form of a lexical analysis graph, 
as shown in Figure [51 the later application of a parser 
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Figure 5 Intended lexical analysis. 




Figure 6 Lexical analysis, as produced by lex-alike lexers, when Integer tokens are assigned greater priority than Real tokens. 




Figure 7 Lexical analysis, as produced by /e2;-alike lexers, when Real tokens are assigned greater priority than Integer tokens. 




Figure 8 Lexical analysis, as produced by Lamb, when Real and Integer tokens share priority value. 




Figure 9 Correct syntactic analysis produced by applying an ambiguity-supporting parsing technique to the lexical analysis 
graph produced by Lamb and shown in Figure [S] 
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(-|\+)?[0-9]+ Integer 

(- I \+) ? [0-9] +\ . [0-9] + Real 

\ . Point 

\/ Slash 

\& Ampersand 

Figure 10 Regular expressions and token names in the speci- 
fication of our ambiguous language. 

• Ampersand Integer Point Integer Ampersand 
Slash Integer Point Integer Slash 

• Ampersand Integer Point Integer Ampersand 
Slash Real Slash 

• Ampersand Real Ampersand Slash Integer Point 
Integer Slash 

• Ampersand Real Ampersand Slash Real Slash 

Figure 11 Different possible token sequences in an input 
string due to the lexically-ambiguous language specification 
in Figure [TOl 

E : := A B 

A : := Ampersand Real Ampersand 

B : := Slash Integer Point Integer Slash 

Figure 12 Context-sensitive syntactic rules to resolve lexical 
ambiguities. 

supporting lexical ambiguities will produce the only pos- 
sible valid sentence, which, in turn, is based on the only 
valid lexical analysis possible. Both of them are shown 
in Figure [HI 

Even though statistical models as Hidden Markov 
Models may produce correct results in similar situations, 
they cannot be used for this kind of language specifica- 
tions, where the specification states how each token is to 
be recognized. Besides, their results may not be always 
accurate, which difficults formally proving their correct- 
ness in a well-defined setting. 

V. CONCLUSIONS 

We have presented a lexical analyzer. Lamb, that sup- 
ports lexical ambiguities. It performs a lexical analysis 
that efficiently captures all the possible sequences of to- 
kens for lexically-ambiguous languages and it generates 
a lexical analysis graph that describes them all. Lamb 
supports assigning priorities to tokens as traditional tech- 
niques do but, in contrast to them, it does not enforce 



these priorities to be set and it allows for priority values 
to be shared. Tokens with shared priorities are consid- 
ered valid alternatives instead of mutually-exclusive op- 
tions. 

The lexical graph can be then fed as input to a parser, 
which will discard any sequence of tokens that does not 
produce a valid syntactic analysis. In summary, our pro- 
posal performs a context-sensitive lexical analysis guided 
by syntactic rules and supports lexically-ambiguous lan- 
guage specifications. 
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